Stylistic Fingerprints, POS-tags, and Inflected Languages: A Case Study in Polish

نویسندگان

چکیده

In stylometric investigations, frequencies of the most frequent words (MFWs) and character n-grams outperform other style-markers, even if their performance varies significantly across languages. inflected languages, word endings play a prominent role, hence different forms cannot be recognized using generic text tokenization. Countless make sparse, making statistical procedures complicated. Presumably, applying one NLP techniques, such as lemmatization and/or parsing, might increase classification. The aim this paper is to examine usefulness grammatical features (as assessed via POS-tag n-grams) lemmatized in recognizing authorial profiles, order address underlying issue degree freedom choice within lexis grammar. Using corpus Polish novels, we performed series supervised authorship attribution benchmarks, compare classification accuracy for types lexical syntactic style-markers. Even POS-tags well was notoriously worse than that markers, difference not substantial never exceeded ca. 15%.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

assessing political stability and instability in central asia and caucasus; case study, azerbaijan and kyrgyzstan

منطقه ی آسیای مرکزی وقفقاز به عنوان منطقه ای تاریخی و به دلیل دارا بودن ذخایر عظیم هیدرو کربنی از اهمیت ویژه ای برخوردار است. کشورهای این منطقه از عوامل عمده ی بی ثباتی نظیر عوامل جغرافیایی، اقتصادی، امنیتی، اجتماعی و سیاسی رنج می برند. پس از فروپاشی اتحاد جماهیر شوروی کشورهای منطقه از نعمت استقلال ناخواسته ای برخوردار شدند که مشکلات فوق را برای آن ها چندین برابر می کرد. در این روند برخی از این...

15 صفحه اول

Projecting POS Tags And Syntactic Dependencies From English And French To Polish In Aligned Corpora

This paper presents the first step to project POS tags and dependencies from English and French to Polish in aligned corpora. Both the English and French parts of the corpus are analysed with a POS tagger and a robust parser. The English/Polish bi-text and the French/Polish bi-text are then aligned at the word level with the GIZA++ package. The intersection of IBM-4 Viterbi alignments for both ...

متن کامل

Tagset Design and Inflected Languages

An experiment designed to explore the relationship between tagging accuracy and the nature of the tagset is described, using corpora in English, French and Swedish. In particular, the question of internal versus external criteria for tagset design is considered, with the general conclusion that external (linguistic) criteria should be followed. Some problems associated with tagging unknown word...

متن کامل

Optimizing Rule-Based Morphosyntactic Analysis of Richly Inflected Languages - a Polish Example

We consider finite-state optimization of morphosyntactic analysis of richly and ambiguously annotated corpora. We propose a general algorithm which, despite being surprisingly simple, proved to be effective in several applications for rulesets which do not match frequently.

متن کامل

Projecting POS tags and syntactic dependencies from English and French to Polish aligned corpora

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Journal of Quantitative Linguistics

سال: 2022

ISSN: ['0929-6174', '1744-5035']

DOI: https://doi.org/10.1080/09296174.2022.2122751